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In the last three decades a large number of compiler transformations for optimizing 
programs have been implemented. Most optimizations for uniprocessors reduce the number 
of instructions executed by the program using transformations based on the analysis of 
scalar quantities and data-flow techniques. In contrast, optimizations for high-performance 
superscalar, vector, and parallel processors maximize parallelism and memory locality with 
transformations that rely on tracking the properties o ... 

Keywords: compilation, dependence analysis, locality, multiprocessors, optimization, 
parallelism, superscalar processors, vectorization 
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Software pipelining is an aggressive scheduling technique that generates efficient code for 
loops and is particularly effective for VLIW architectures. Few software pipelining algorithms, 
however, are able to efficiently schedule loops that contain conditional branches. We have 
developed an algorithm we call All Paths Pipelining (APP) that addresses this shortcoming of 
software pipelining. APP is designed to achieve optimal or near-optimal performance for any 
run of iterations while providing ef ... 
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April 1992 ACM SIGARCH Computer Architecture News , Proceedings of the 19th 

annual international symposium on Computer architecture, volume 20 issue 2 
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This report describes the results of studies of the WM architecture— its performance, the 
values of some of its key architectural parameters, the difficulty of compiling for it, and 
hardware implementation complexity. The studies confirm that, with comparable chip area 
and without heroic compiler technology, WM is capable of outperforming traditional scalar 
architectures by factors of 2-9. They also underscore the need to devise higher bandwidth 
memory systems. 
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Hardware multithreading is becoming a generally applied technique in the next generation of 
microprocessors. Several multithreaded processors are announced by industry or already 
into production in the areas of high-performance microprocessors, media, and network 
processors. A multithreaded processor is able to pursue two or more threads of control in 
parallel within the processor pipeline. The contexts of two or more threads of control are 
often stored in separate on-chip register sets. Unused i ... 
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Increasingly wider superscalar processors are experiencing diminishing performance returns 
while requiring larger portions of die area dedicated to control rather than datapath. As an 
alternative to using these processors to exploit parallelism effectively, we are investigating 
the viability of using single-chip vector microprocessors. This paper presents some initial 
results of our investigation where we compare the performance and cost of vector 
microprocessors to that of aggressive, out-of-or ... 
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In this paper, we propose a multithreaded processor architecture which improves machine 
throughput. In our processor architecture, instructions from different threads (not a single 
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' thread) are issued simultaneously to multiple functional units, and these instructions can 
begin execution unless there are functional unit conflicts. This parallel execution scheme 
greatly improves the utilization of the functional unit. Simulation results show that by 
executing two and four threads in parallel ... 
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HPSm is a single-chip microarchitecture designed and implemented at the University of 
California to achieve high performance. The approach is to exploit both vertical and 
horizontal concurrency in the microarchitecture. Experiments have been conducted to 
demonstrate the effectiveness of HPSm as compared to a popular single-chip 
microarchitecture, the Berkeley RISC/SPUR. Evaluations have been done with both control 
intensive and floating point intensive benchmarks. For both types of benchmar ... 
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This paper describes the architectural and organizational tradeoffs made during the design 
of the MuitiTitan, and provides data supporting the decisions made. These decisions covered 
the entire space of processor design, from the instruction set and virtual memory 
architecture through the pipeline and organization of the machine. In particular, some of the 
tradeoffs involved the use of an on-chip instruction cache with off-chip TLB and floating- 
point unit, the use of direct-mapped instead o ... 
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Multiscalar processors use a new, aggressive implementation paradigm for extracting large 
quantities of instruction level parallelism from ordinary high level language programs. A 
single program is divided into a collection of tasks by a combination of software and 
hardware. The tasks are distributed to a number of parallel processing units which reside 
within a processor complex. Each of these units fetches and executes instructions belonging 
to its assigned task. The appearance of a single log ... 
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KCM (Knowledge Crunching Machine) is a high-performance back-end processor which, 
coupled to a UNIX* desk-top workstation, provides a powerful and user-friendly Prolog 
environment catering for both development and execution of significant Prolog applications. 
This paper gives a general overview of the architecture of KCM stressing some new features 
like a 64-bit tagged architecture, shallow backtracking and an original memory management 
unit. Some early benchmark result ... 
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In conventional superscalar microarchitectures with partitioned integer and floating-point 
resources, all floating-point resources are idle during execution of integer programs. 
Palacharla and Smith [26] addressed this drawback and proposed that the floating-point 
subsystem be augmented to support integer operations. The hardware changes required are 
expected to be fairly minimal.To exploit these idle floating resources, the compiler must 
identify integer code that can be profitably offloaded to ... 
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This paper examines simultaneous multithreading, a technique permitting several 
independent threads to issue instructions to a superscalar's multiple functional units in a 
single cycle. We present several models of simultaneous multithreading and compare them 
with alternative organizations: a wide superscalar, a fine-grain multithreaded processor, and 
single-chip, multiple-issue multiprocessing architectures. Our results show that both (single- 
threaded) superscalar and fine-grain multithr ... 
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The IA-64, Intel's 64-bit instruction set architecture, exhibits a number of interesting 
architectural features. Here we consider those features as they relate to supporting garbage 
collection (GC). We aim to assist GC and compiler implementors by describing how one may 
exploit features of the IA-64. Along the way, we record some previously unpublished object 
scanning techniques, and offer novel ones for object allocation (suggesting some simple 
operating system support that would simplify it ... 
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To achieve high performance, contemporary computer systems rely on two forms of 
parallelism: instruction-level parallelism (ILP) and thread-level parallelism (TLP). Wide-issue 
super-scalar processors exploit ILP by executing multiple instructions from a single program 
in a single cycle. Multiprocessors (MP) exploit TLP by executing different threads in parallel 
on different processors. Unfortunately, both parallel processing styles statically partition 
processor resources, thus preventing t ... 

Keywords: cache interference, instruction-level parallelism, multiprocessors, 
multithreading, simultaneous multithreading, thread-level parallelism 
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Dynamic superscalar processors execute multiple instructions out-of-order by looking for 
independent operations within a large window. The number of physical registers within the 
processor has a direct impact on the size of this window as most in-flight instructions 
require a new physical register at dispatch. A large multi-ported register file helps improve 
the instruction-level parallelism (ILP), but may have a detrimental effect on clock speed, 
especially in future wire-limited technologies. ... 
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